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ABSTRACT 

This paper presents an algorithm to automatically design 
networks with torus topologies, such as ones widely used 
in large-scale supercomputers. The characteristic feature of 
our approach is that real life equipment prices and values 
of technical characteristics are used. As a result, we also 
have the opportunity to compare costs of torus and fat-tree 
networks. 

The algorithm is useful as a part of a bigger design procedure 
that selects optimal hardware of cluster supercomputer as a 
whole. 

Categories and Subject Descriptors 

C.2.1 [Computer-Communication Networks]: Network 
Architecture and Design — Network topology; K.6.2 [Mana- 
gement of Computing and Information Systems]: In- 
stallation Management — Computer selection 

General Terms 

Design, Economics 

Keywords 

Torus network 

1. INTRODUCTION 

Torus networks are frequently used in large-scale supercom- 
puters as a cost-efficient alternative to other topologies. Re- 
cently it was demonstrated that torus networks for computer 
clusters can be built from affordable commodity hardware 
such as InfiniBand. 

We describe an algorithm that allows to automatically de- 
sign torus networks. The algorithm is implemented in a 
software tool [8]. 

This algorithm is intended to be used as a part of a CAD 
system for cluster supercomputers [9]. Such a system would 
iterate through different combinations of hardware, vary- 
ing the number of compute nodes and other parameters. 
Thus, designing an interconnection network for every hard- 
ware combination under review is a self-contained and highly 
repetitive operation that must be performed efficiently. 

During the design process, other characteristics of intercon- 
nection networks, such as reliability, can be estimated and 
used as design constraints or as a part of a complex objective 
function. 



The rest of the article is organized as follows. Section 2 de- 
scribes relevant work in the field of torus networks. Section 

3 details the design of the "Gordon" supercomputer. Section 

4 introduces the algorithm, and section 5 compares costs of 
torus networks with fat-trees. Finally, section 6 concludes 
the article. 

2. RELATED WORK 

Torus networks have found widespread use in supercomput- 
ing. IBM used a 3D torus network in BlueGene/L, and a 5D 
network [3] in BlueGene/Q. A 6D mesh-torus network was 
used in "K Computer" [1]. Both are direct networks, where 
compute nodes are connected directly to their neighbours, 
as opposed to switched fabrics, where nodes are first con- 
nected to switches, and then switches are connected to each 
other in a torus topology. The example of the latter is a 3D 
torus network of the Gordon supercomputer [11]. 

Torus networks are inherently prone to congestion, but this 
is mitigated by designers by increasing the number of di- 
mensions. Commenting on the Gordon project, Strande [10] 
quotes the following benefits of torus networks: (a) lower 
cost compared to fat-trees and (b) easy linear scaling along 
one of dimensions. However, such scaling may result in un- 
balanced topologies, leading to bigger latencies and higher 
congestion on the links in that dimension. Strande also men- 
tions that the torus topology uses short cables, which makes 
the use of fibre optical cables unnecessary, leading to further 
cost savings. 

Navaridas and Miguel- Alonso [6] analysed performance of 
2D switch-based torus topologies and fat-trees for up to 7680 
compute nodes, on a range of workloads, using simulation 
techniques. They conclude that performance degradation 
from using torus networks, compared to fat-trees, can reach 
20. .40%, and sometimes more, on communication-intensive 
workloads, which limits applicability of tori in larger instal- 
lations. 

Camara et al. [2] introduced the technique to turn unbal- 
anced rectangular 2D and 3D tori to twisted tori by rear- 
ranging peripheral links, which improves performance char- 
acteristics as well as regains network symmetry. 

3. 3D DUAL-RAIL TORUS NETWORK OF 
THE GORDON SUPERCOMPUTER 

Gordon supercomputer [11] uses InfiniBand switches with 
P — 36 ports of 4X QDR technology. Switches form a 4x4x4 



torus; each switch has 6 neighbours, to which it connects 
with 3 links, thereby utilizing 18 ports out of 36. 17 more 
ports are used to connect 16 compute nodes and one I/O 
node. 

The network is dual-rail, therefore there are actually two 
tori made of switches, and compute and I/O nodes have two 
network interfaces, one of which is used to connect to the 
switch in the first torus ("rail"), and the other to the second 
one. Currently, one rail is used for MPI, and the other one 
for I/O traffic. According to Strande [10], there are plans to 
use both rails simultaneously to provide failover capabilities 
and improve bandwidth. 

4. ALGORITHM FOR DESIGNING TORUS 
NETWORKS 

We propose the algorithm to calculate the number of switches 
in a torus network, using as input the number of compute 
nodes to be interconnected and, optionally, a blocking fac- 
tor that determines the distribution of ports on a switch 
between compute nodes and neighbouring switches. The al- 
gorithm is suitable to design networks built with commodity 
hardware, such as Gordon's network. 

As torus networks are inherently prone to congestion, im- 
posing additional blocking at the switch level is very disad- 
vantageous. However, sometimes blocking is stipulated by 
the hardware manufacturer, and cannot be avoided. For ex- 
ample, in [6] the hardware under review was a blade chassis 
equipped with N = 20 compute nodes and an InfiniBand 
switch with P = 36 ports. Only 16 ports of the switch were 
used to connect it to the outside world, which resulted in 
Bl = 20/16 = 1,25 blocking factor. In order to build torus 
networks for such hardware with the proposed algorithm, we 
need to specify the blocking factor as an input. 

The algorithm tries to build a network using identical switches 
with Pe ports. Let us describe the algorithm by stages. In 
line 1 we check if the switch has enough ports to connect all 
N nodes. In this case, we use the star topology with only 
one switch and exit. 

Otherwise, we will build a ring or a torus. In lines 8.. 10 
we calculate the number of switch ports that go to compute 
nodes and to the neighbouring switches, and then recalculate 
the blocking factor for the network. On line 11 we derive 
the minimal number of switches required to connect TV nodes 
with a given blocking factor. The actual torus will contain 
slightly more switches (generally, the increase is within 20% 
for small networks, and within several percent for the large 
ones) . 

On line 12, we use a heuristic to determine the number of 
torus dimensions, based on the number of switches. It is 
important to note that there are no hard rules when choos- 
ing the number of dimensions. Choosing a low number of 
dimensions for a high number of compute nodes leads to in- 
creased network diameter and therefore latencies. On the 
other side, choosing a too high number of dimensions for a 
low number of compute nodes does not provide network per- 
formance benefits but results in complex cabling patterns. 
In the case of direct networks this scenario also requires net- 
work adapters with an unnecessarily large number of ports. 



The optimal number of dimensions depends on the com- 
munication pattern of the application, and can be reliably 
determined, for any given application, only through bench- 
marking on real hardware or by using simulation such as in 
[6]. Therefore we relied on using a heuristic. 

Currently, the dimension choice heuristic returns the number 
of dimensions as per Table 1, up to D = 5. The layout of 
switches in the maximal configuration for that number of 
dimensions is provided in the last column of the table for 
reference. 

If the heuristic returns D — 1, then we use the ring topology 
(line 14). Otherwise, we use the torus topology, and need to 
calculate the number of switches along each of D dimensions 
by rounding RJ~E to the nearest integer (line 17). 



Algorithm 1 Design a torus network 
Input: 

TV: Number of nodes to interconnect 
Bl: Blocking factor 
Pe'- Number of switch ports 
Goal: Optimal network structure: 
D: Number of torus dimensions 

d — {di , . . . , do) '■ Number of switches along each dimen- 
sion 

E: Total number of switches 
Bl r : Resulting blocking factor 
L: Number of cables 

/: Objective function for the optimal network structure 
1: if P B > N then 

2: {If there exists a switch with N or more ports } 

3: print Topology: star 

4: E4-\\ Bl r 4- 1; L <- N 

5: Compute / 

6: Exit 

7: end if 

8: P En <- [Pe • (Bl/{1 + Bl))} { Ports to nodes } 
9: Pec Pe — Peu { Ports to other switches } 
10: Bl r Peu/Pec { Resulting blocking } 
11: E <s— \N/Peu] { Minimal number of switches } 
12: D 4— GetDimCount(E) { Heuristic for the number of 

torus dimensions} 
13: if D = 1 then 
14: print Topology: ring 
15: else 

16: print Topology: torus 

17: d z 4- round{ tyE) \ i = 1 ... D - 1 { Number of 

switches along dimensions } 
18: dr> <— \E/d^ _1 ] { Switches in the last dimension } 
19: E <s— Yli=i di { Actual number of switches } 
20: end if 

21: L <r- N + E ■ Pec/2 { Number of cables } 
22: Compute / 



This creates a topology close to an ideal square, cube, etc. 
Packaging constraints, however, may preclude from using 
this particular ideal layout, and in the resulting unbalanced 
torus the number of switches along dimensions may differ 
significantly. The number of switches, E, still remains the 
same as returned by the algorithm, allowing to correctly 
calculate equipment cost and other metrics. 



Switch count, E Topology Dimensions, D Max. configuration 



2 or 3 


Ring 


1 




up to 36 
up to 125 
up to 2401 
more than 2401 


Torus 


2 
3 
4 
5 


6x6 
5x5x5 
7x7x7x7 
(As appropriate) 


Table 1: Heuristic for the number of torus dimensions 


Compute nodes, N Dimensions, D 


Torus topology 


Supercomputer of comparable size 


1,000 3 
6,000 4 
8,000 4 
10,000 4 
19,000 4 




4x4x4 
4x4x4x6 
5x5x5x4 
5x5x5x5 
6x6x6x5 


Gordon [11] 
Stampede [12] 
Tianhe-IA [5] 
SuperMUC [4] 
Titan [7] 


Table 2: 


Sample output for Algorithm 1 



On the next step, we calculate the number of switches in the 
last dimension (line 18) and recalculate the total number of 
switches as the product of switch counts along all dimensions 
(line 19). 

The number of cables is determined on line 21. The num- 
ber of switch ports facing to neighbouring switches, Pec, is 
divided by two, because two ports are connected with one 
cable. This is then multiplied by the number of switches 
E. Compute nodes are connected with additional N cables. 
The network is expandable from N up to E ■ Pe compute 
nodes. Inter-switch links run in bundles of approximately 
Pbc/(2 • D), therefore it is often possible to use cables that 
integrate several links (such as a 12x InfiniBand cable that 
integrates three 4x links) to reduce the number of physical 
cables, simplifying installation. 

Sample output of the algorithm for commodity InfiniBand 
switches with Pe = 36 ports and a non-blocking network 
(Bl — 1) is presented in Table 2. 

5. COST COMPARISON OF TORUS AND 
FAT-TREE NETWORKS 

We used real life equipment costs provided by Mellanox 
Technologies to derive costs of fat-tree and torus networks 
for up to 3,888 compute nodes. We utilized the tool for 
automated design of cluster interconnection networks [8]. 
Equipment costs are given for the older generation of equip- 
ment (InfiniBand QDR), and technical characteristics are 
summarized in Table 3. Cable cost is assumed to be $80. 

We consider three models of switches. The first of them, the 
36-port switch, is used for building torus networks, and is 
also utilized on edge level of fat-tree networks. The other 
two are modular switches that have 108 and 216 ports in 
their maximal configurations. The actual number of sup- 
ported ports depends on the number of installed line cards, 
which leads to 6 and 12 configurations of these switches, re- 
spectively. Each configuration has its own set of technical 
characteristics as well as cost. 

The set of equipment described above allows to build non- 



blocking fat-tree networks with up to N max = Pe ■ Pc/2 = 
36 ■ 216/2 = 3888 nodes. On Fig. 1 we plot costs of non- 
blocking as well as 2:1 blocking fat-tree networks, and torus 
networks. As expected, the cost of 2:1 blocking fat-trees is 
lower than of their non-blocking counterparts; but reduction 
in cost is less than twofold. Torus networks are consistently 
cheaper than fat-trees; however, their inherent blocking may 
have detrimental effect on application performance that will 
not be offset by lower costs. 

We also consider an alternative way of building fat-trees: 
using 36-port switches for both core and edge layers. This 
allows to build non-blocking fat-tree networks with up to 
Nmax = 36-36/2 = 648 nodes. Such networks are charac- 
terized by complex wiring patterns between the two layers, 
but are marginally cheaper to build. Fig. 2 is essentially a 
close-up of the previous figure, focusing on values of JV up 
to 648 nodes, with an additional curve representing costs of 
the alternative fat-tree building method. 

As the diagram indicates, using 36-port switches for build- 
ing fat-trees does indeed lead to certain cost savings: for 
N = 648 nodes, per-port cost of such networks is roughly 
$1,060, while for the usual way of building fat-trees, us- 
ing modular switches on the core level, the per-port cost is 
roughly $1,930. However, these savings should be weighted 
against the cost of compute nodes: if the latter is much 
higher than the per-port cost of the interconnection net- 
work, then cost savings might not justify increased wiring 
and maintenance complexity of this type of networks. 

Example. Let us assume the cost of a compute node is 
$5, 000. If per-port cost of two types of interconnection net- 
works is $1,000 and $2,000, respectively, then savings from 
using the network of the first type is 7000/6000, or roughly 
17%. Factoring in costs of other equipment, as well as op- 
erating expenses, further dilutes savings. 

Figure 2 is particularly helpful to emphasize the structure 
of networks generated by the network design tool [8] . Con- 
sider, for example, the case of non-blocking and 2:1 blocking 
fat-trees, for N = 150 compute nodes. The costs of these 
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Figure 1: Cost comparison of fat-tree and torus networks 
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Figure 2: Cost comparison of alternative fat-tree building methods 



two networks are very close, but their structure is entirely 
different, which is summarized in Table 4. 

If the tool is requested to design a non-blocking network, 
it chooses a star topology with a single modular switch. If, 
however, a 2:1 blocking network is requested, the result is 
a two-layer fat-tree, with 36-port switches on the edge level 
and a 90-port switch on the core level. The latter network is 
chosen because it is marginally (5%) cheaper. At the same 
time, it draws 85% more power and requires 40% more space 
in the rack. 

This example illustrates two points: (A) more complex cri- 
terion functions, such as total cost of ownership, should 
preferably be used instead of capital costs; (B) trying to 
design blocking networks doesn't necessarily save consider- 
able amounts of money, therefore designers should consider 
non-blocking networks first. 
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6. CONCLUSIONS 

We presented a simple algorithm for automated design of 
torus networks. The algorithm relies on a heuristic to choose 
the number of torus dimensions. We also compared real life 
costs of torus and fat-tree networks. We found that torus 
networks are consistently cheaper than non-blocking and 2:1 
blocking fat-trees; however, these cost savings may not offset 
performance penalties, depending on applications used. 
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